The problem of contextual advertising.
All contextual ad vendors claim that their product is "semantic", in the sense that the matching algorithm is a bit smarter than keyword matching. Yet, these products are generally not based on the semantic web and RDF, where most of the concepts that we think about can be mapped to precisely defined concepts from sources like DBpedia and Freebase
Some kind of 'semantic' capability is required to deal with the issues that: (i) a given word can mean different things in different contexts and (ii) people can use different words to say the same thing.
For a web site that's based on textual content, it's possible to tackle the above two problems by taking advantage of statistical regularities in text. For instance, the word jaguar could refer to an animal, a car, or several other things. If the document also contains words like transmission, steering, and engine, the document is likely to be about a car. A document that contains words like habitat, fur and prey is likely to be about a cat.
Latent Dirichlet Allocation and friends
Many systems that take advantage of this kind of knowledge represent it implicitly rather than explicitly.
Back in the 1980's, a technique know as Latent Semantic Analysis was developed. It got a lot of academic attention, but limited commercial use because the method was patented and because the Singular Value Decomposition, which it depends on, scales as the third power of the size of a matrix.
A modern take on this idea is Latent Dirichlet Allocation which imagines that documents are composed out of topics, each of which has a specific distribution of words. Like LSI, this compresses a vocabulary space of hundreds of thousands of words into a much smaller space of, say, a few hundred topics. Highly scalable algorithms are known for LDA, and LDA has a solid intellectual footing which makes it possible to extend and improve on LDA in numerous ways.
LSI and LDA are just two algorithms for extracting the "essence" of long documents, and most contextual advertising systems use something similar to
What if you don't have enough text?
For a site like Ookaboo, however, which is primarily visual, there is little text on the page. Although important keywords are there, there is also a lot of boilerplate text which is not on topic, so I've found that contextual advertising systems can have a hard time matching relevant ads.
When analyzing or searching over microposts, such as tweets and SMS messages, we encounter a similar problem. With a tiny amount of text, the meaning of every word counts, and we can't count on large numbers to make a weak algorithm work.
In the case of Ookaboo, however, we've got an unfair advantage. Ookaboo is a photo collection that was built around Linked Data concepts. Each image in Ookaboo is indexed against concepts that were mined from DBpedia and Freebase, such as :Bamboo or :Ventura_Boulevard. Since we already know what the topic is, we don't need to analyse or understand the text at all. After all, converting concepts to text is like scrambling an egg, and never scrambling it at all lets us reason about the concepts directly.
Minimum viable product
To trial this idea on Ookaboo, I decided to sell books relevant to topics from Amazon.com. This made sense for two reasons: first, in principle we can find books relevant to any topic, and second, the Amazon API is easy to work with.
where the number is the Ookaboo id. The ad server looks up books that match the topic, checks pricing and availability from Amazon, and products HTML to fill the ad spot. Unlike an ad served in an image or an
<IFRAME>, the vertical size of the ad isn't fixed, so I don't need to worry about exactly how many pixels are consumed by the images and the text. Also, the system can decide to show more or fewer books (or none at all) if it wants to do that.
Users perceive the books as relevant because they have a click through rate perhaps 10 times higher than other ads I've tried, but the revenue numbers aren't 10 times better because I only get paid when somebody makes a purchase, and Ookaboo users aren't exactly burning with intent to purchase books. Yet, unlike the creepy pair of shoes that follow you around from Zappo's, the ads legitimize rather than deligitimize the site because they are relevant but not privacy invading.
Clearly this concept can be developed in other directions. Many concepts are themselves products that can be sold, and by following semantic relationships we can discover competing and complementary products. If somebody is looking at a grand hotel, we could try to book them a room there. If somebody is looking at a tourist attraction, we can use spatial reasoning to find them travel and accommodation offers nearby. It's clear that PPC and display ads could be matched this way as well.
If you're developing a site that is based on linked data, I'd love to talk with you about adding your site to my network. With the work I'm doing on Infovore and :BaseKB I can easily match up products with DBpedia and Freebase, and also match up DBpedia and Freebase to other databases. Please email me today.
Creator of database animals and bayesian brains